Gaussian Processes (GP) — Regression & Probabilistic Classification#

Gaussian Processes (GPs) are a Bayesian, nonparametric way to do supervised learning.

  • GP regression: returns a distribution over functions that fit the data.

  • GP classification: returns probabilities (via a latent GP + a link function).

A GP is best thought of as:

A probability distribution over plausible functions.


Why people like GPs#

Advantages:

  • Interpolates the observations (for many kernels when the noise is very small).

  • Probabilistic predictions: you get uncertainty estimates (e.g., 95% intervals).

  • Versatile kernels: you can encode assumptions (smoothness, periodicity, roughness).

Disadvantages:

  • Not sparse by default: it uses all data at prediction time.

  • Training can be expensive: typically \(\mathcal{O}(n^3)\) time and \(\mathcal{O}(n^2)\) memory.

  • Can struggle in very high-dimensional feature spaces without careful kernel design / scaling.

This notebook focuses on intuition + statistical background (Gaussian conditioning), uses Plotly visuals heavily, implements GP regression from scratch, and then connects the results to scikit-learn for both regression and classification.


Learning goals#

By the end you should be able to:

  • explain what a GP is (distribution over functions) and what a kernel means

  • sample functions from a GP prior and interpret how kernels change function shapes

  • derive GP regression using conditioning a multivariate Gaussian

  • implement GP regression in NumPy (Cholesky-based)

  • interpret the posterior mean and posterior variance

  • understand log marginal likelihood and why it balances fit vs complexity

  • use GaussianProcessRegressor and GaussianProcessClassifier in scikit-learn and explain key parameters


Notation#

  • Inputs: \(X = [x_1,\dots,x_n]^T\) (shape \(n\times d\))

  • Latent function values: \(f = [f(x_1),\dots,f(x_n)]^T\)

  • Observations (regression): \(y = f + \varepsilon\), with \(\varepsilon \sim \mathcal{N}(0,\sigma^2 I)\)

  • Kernel (covariance function): \(k(x,z)\)

  • Kernel matrix: \(K(X,X)\) where \(K_{ij} = k(x_i, x_j)\)


Table of contents#

  1. The GP idea: a distribution over functions

  2. Kernels as similarity + shape priors (Plotly demos)

  3. GP regression via Gaussian conditioning (math + intuition)

  4. GP regression from scratch (NumPy)

  5. Hyperparameters + log marginal likelihood (Occam’s razor)

  6. scikit-learn GP regression + parameter map

  7. GP classification: latent GP + logistic link + Laplace (intuition)

  8. scikit-learn GP classification (probability maps)

  9. Practical tips, limitations, and scaling strategies

  10. Exercises + references

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from dataclasses import dataclass

from scipy.special import expit  # sigmoid

from sklearn.gaussian_process import GaussianProcessRegressor, GaussianProcessClassifier
from sklearn.gaussian_process.kernels import (
    RBF,
    Matern,
    ExpSineSquared,
    RationalQuadratic,
    WhiteKernel,
    ConstantKernel,
)
from sklearn.datasets import make_moons
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

1) The GP idea: a distribution over functions#

A Gaussian Process is defined as:

\[ f(x) \sim \mathcal{GP}(m(x),\;k(x,x')) \]

meaning: for any finite set of inputs \(X=[x_1,\dots,x_n]\), the vector of function values

\[ f = [f(x_1),\dots,f(x_n)]^T \]

is multivariate Gaussian:

\[ f \sim \mathcal{N}(m, K)\quad \text{where}\quad m_i=m(x_i),\; K_{ij}=k(x_i,x_j). \]

Intuition:

  • The mean function \(m(x)\) is often set to 0 (a neutral prior).

  • The kernel \(k(x,z)\) says how much you expect two points to co-vary.

  • Sampling from a GP prior gives you random functions consistent with those assumptions.

def _as_2d(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    if X.ndim == 1:
        return X.reshape(-1, 1)
    return X


def rbf_kernel(
    X: np.ndarray,
    Z: np.ndarray,
    length_scale: float = 1.0,
    variance: float = 1.0,
) -> np.ndarray:
    X = _as_2d(X)
    Z = _as_2d(Z)
    length_scale = float(length_scale)
    variance = float(variance)

    X_norm = np.sum(X**2, axis=1)[:, None]
    Z_norm = np.sum(Z**2, axis=1)[None, :]
    dist2 = X_norm + Z_norm - 2.0 * (X @ Z.T)
    return variance * np.exp(-0.5 * dist2 / (length_scale**2))


def periodic_kernel(
    X: np.ndarray,
    Z: np.ndarray,
    length_scale: float = 1.0,
    period: float = 1.0,
    variance: float = 1.0,
) -> np.ndarray:
    X = _as_2d(X)
    Z = _as_2d(Z)
    length_scale = float(length_scale)
    period = float(period)
    variance = float(variance)

    # For 1D this is just |x-z|; for d>1 you could generalize with norms.
    d = np.abs(X - Z.T)
    return variance * np.exp(-2.0 * (np.sin(np.pi * d / period) ** 2) / (length_scale**2))


def sample_gp_prior_1d(
    x_grid: np.ndarray,
    kernel_fn,
    n_samples: int = 5,
    jitter: float = 1e-9,
) -> np.ndarray:
    X = _as_2d(x_grid)
    K = kernel_fn(X, X) + jitter * np.eye(X.shape[0])
    return rng.multivariate_normal(mean=np.zeros(X.shape[0]), cov=K, size=n_samples)


def plot_gp_samples_1d(
    x: np.ndarray,
    samples: np.ndarray,
    title: str,
    y_mean: np.ndarray | None = None,
    y_std: np.ndarray | None = None,
    y_train: np.ndarray | None = None,
    x_train: np.ndarray | None = None,
) -> go.Figure:
    fig = go.Figure()

    if y_mean is not None and y_std is not None:
        y_mean = np.asarray(y_mean)
        y_std = np.asarray(y_std)

        upper = y_mean + 1.96 * y_std
        lower = y_mean - 1.96 * y_std

        fig.add_trace(go.Scatter(x=x, y=upper, mode="lines", line=dict(width=0), showlegend=False))
        fig.add_trace(
            go.Scatter(
                x=x,
                y=lower,
                mode="lines",
                line=dict(width=0),
                fill="tonexty",
                fillcolor="rgba(31, 119, 180, 0.18)",
                name="95% interval",
            )
        )
        fig.add_trace(go.Scatter(x=x, y=y_mean, mode="lines", name="mean", line=dict(width=3)))

    for i, s in enumerate(samples):
        fig.add_trace(
            go.Scatter(
                x=x,
                y=s,
                mode="lines",
                line=dict(width=1),
                name=f"sample {i+1}",
                opacity=0.7,
                showlegend=i == 0,
            )
        )

    if x_train is not None and y_train is not None:
        fig.add_trace(
            go.Scatter(
                x=np.asarray(x_train),
                y=np.asarray(y_train),
                mode="markers",
                marker=dict(size=8, color="black"),
                name="observations",
            )
        )

    fig.update_layout(title=title, xaxis_title="x", yaxis_title="f(x)")
    return fig
x = np.linspace(-5, 5, 220)

kernels = {
    "RBF (l=0.4)": lambda X, Z: rbf_kernel(X, Z, length_scale=0.4, variance=1.0),
    "RBF (l=1.5)": lambda X, Z: rbf_kernel(X, Z, length_scale=1.5, variance=1.0),
    "Periodic (p=2.5)": lambda X, Z: periodic_kernel(X, Z, length_scale=0.6, period=2.5, variance=1.0),
}

fig = go.Figure()
for name, kfn in kernels.items():
    k_slice = kfn(x.reshape(-1, 1), np.array([[0.0]])).ravel()
    fig.add_trace(go.Scatter(x=x, y=k_slice, mode="lines", name=name))

fig.update_layout(
    title="Kernel slice k(x, 0): how correlation decays with distance",
    xaxis_title="x",
    yaxis_title="k(x, 0)",
)
fig.show()
x_grid = np.linspace(-4, 4, 160)

samples_rbf_short = sample_gp_prior_1d(x_grid, kernels["RBF (l=0.4)"], n_samples=6)
samples_rbf_long = sample_gp_prior_1d(x_grid, kernels["RBF (l=1.5)"], n_samples=6)
samples_per = sample_gp_prior_1d(x_grid, kernels["Periodic (p=2.5)"], n_samples=6)

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=3, subplot_titles=["RBF: short length-scale", "RBF: long length-scale", "Periodic kernel"])

for s in samples_rbf_short:
    fig.add_trace(go.Scatter(x=x_grid, y=s, mode="lines", line=dict(width=1), showlegend=False), row=1, col=1)
for s in samples_rbf_long:
    fig.add_trace(go.Scatter(x=x_grid, y=s, mode="lines", line=dict(width=1), showlegend=False), row=1, col=2)
for s in samples_per:
    fig.add_trace(go.Scatter(x=x_grid, y=s, mode="lines", line=dict(width=1), showlegend=False), row=1, col=3)

fig.update_layout(title="Sampling random functions from GP priors (kernels = shape assumptions)")
fig.show()

2) Kernels as similarity + shape priors#

Kernels do two jobs at once:

  1. Similarity: if \(k(x,z)\) is large, the model believes \(f(x)\) and \(f(z)\) should be similar.

  2. Shape prior: the kernel determines what kind of functions are likely (smooth, wiggly, periodic, etc.).

One useful mental picture is the kernel matrix \(K(X,X)\):

  • If it has large off-diagonal values, points are strongly coupled → smoother functions.

  • If it is close to diagonal, points are weakly coupled → rougher / more local behavior.

Xk = x_grid.reshape(-1, 1)
K_short = kernels["RBF (l=0.4)"](Xk, Xk)
K_long = kernels["RBF (l=1.5)"](Xk, Xk)

fig = make_subplots(rows=1, cols=2, subplot_titles=["RBF kernel matrix (short l)", "RBF kernel matrix (long l)"])
fig.add_trace(go.Heatmap(z=K_short, colorscale="Viridis", showscale=False), row=1, col=1)
fig.add_trace(go.Heatmap(z=K_long, colorscale="Viridis", showscale=True), row=1, col=2)

fig.update_layout(title="Kernel matrices: longer length-scale couples points more strongly")
fig.show()

3) GP regression via Gaussian conditioning#

The core math trick in GP regression is: conditioning a joint Gaussian.

Assume a zero-mean prior (for simplicity):

\[ f \sim \mathcal{N}(0, K(X,X)). \]

And a Gaussian noise model:

\[ y = f + \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I). \]

So:

\[ y \sim \mathcal{N}(0,\;K(X,X) + \sigma^2 I). \]

For test points \(X_*\), the joint distribution of training outputs \(y\) and latent test function values \(f_*\) is Gaussian:

\[\begin{split} \begin{bmatrix} y \\ f_* \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} K(X,X)+\sigma^2 I & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{bmatrix} \right). \end{split}\]

Conditioning a Gaussian gives the posterior:

\[ \mu_* = K(X_*,X)\,[K(X,X)+\sigma^2 I]^{-1} y \]
\[ \Sigma_* = K(X_*,X_*) - K(X_*,X)\,[K(X,X)+\sigma^2 I]^{-1}K(X,X_*). \]

Interpretation:

  • The posterior mean is a kernel-weighted combination of observed outputs.

  • The posterior variance shrinks near observed points and grows far away.

def gp_posterior_regression(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray,
    kernel_fn,
    noise_variance: float = 1e-4,
    jitter: float = 1e-9,
) -> tuple[np.ndarray, np.ndarray]:
    X_train = _as_2d(X_train)
    X_test = _as_2d(X_test)
    y_train = np.asarray(y_train, dtype=float).ravel()

    n = X_train.shape[0]

    K = kernel_fn(X_train, X_train) + (noise_variance + jitter) * np.eye(n)
    L = np.linalg.cholesky(K)

    # alpha = K^{-1} y via two triangular solves
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))

    K_s = kernel_fn(X_train, X_test)  # n x m
    mu = K_s.T @ alpha               # m

    v = np.linalg.solve(L, K_s)      # n x m
    K_ss = kernel_fn(X_test, X_test)
    cov = K_ss - v.T @ v

    return mu, cov


def log_marginal_likelihood_zero_mean(
    X_train: np.ndarray,
    y_train: np.ndarray,
    kernel_fn,
    noise_variance: float,
    jitter: float = 1e-9,
) -> float:
    X_train = _as_2d(X_train)
    y_train = np.asarray(y_train, dtype=float).ravel()
    n = X_train.shape[0]

    K = kernel_fn(X_train, X_train) + (noise_variance + jitter) * np.eye(n)
    L = np.linalg.cholesky(K)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))

    log_det = 2.0 * float(np.sum(np.log(np.diag(L))))
    quad = float(y_train @ alpha)
    return -0.5 * quad - 0.5 * log_det - 0.5 * n * np.log(2.0 * np.pi)
# Synthetic regression: noisy samples from a smooth function
def f_true(x: np.ndarray) -> np.ndarray:
    return np.sin(0.8 * x) + 0.25 * np.cos(2.0 * x)


n_train = 18
X_train = np.sort(rng.uniform(-4.5, 4.5, size=n_train))
sigma = 0.15
y_train = f_true(X_train) + sigma * rng.normal(size=n_train)

X_test = np.linspace(-6, 6, 400)

k = lambda X, Z: rbf_kernel(X, Z, length_scale=1.2, variance=1.0)
mu, cov = gp_posterior_regression(X_train, y_train, X_test, k, noise_variance=sigma**2)
std = np.sqrt(np.clip(np.diag(cov), 0.0, np.inf))

posterior_samples = rng.multivariate_normal(mu, cov + 1e-8 * np.eye(len(X_test)), size=5)

fig = plot_gp_samples_1d(
    X_test,
    posterior_samples,
    title="GP regression posterior (RBF kernel): mean + uncertainty + samples",
    y_mean=mu,
    y_std=std,
    x_train=X_train,
    y_train=y_train,
)
fig.add_trace(go.Scatter(x=X_test, y=f_true(X_test), mode="lines", line=dict(dash="dot"), name="true f(x)", opacity=0.7))
fig.show()
# Interpolation vs smoothing: change the assumed noise
noise_levels = [1e-6, sigma**2, 0.5**2]
titles = ["almost noiseless (interpolates)", "matched noise", "high noise (smoother)"]

fig = make_subplots(rows=1, cols=3, subplot_titles=titles)

for col, nv in enumerate(noise_levels, start=1):
    mu_c, cov_c = gp_posterior_regression(X_train, y_train, X_test, k, noise_variance=nv)
    std_c = np.sqrt(np.clip(np.diag(cov_c), 0.0, np.inf))

    fig.add_trace(go.Scatter(x=X_test, y=mu_c + 1.96 * std_c, mode="lines", line=dict(width=0), showlegend=False), row=1, col=col)
    fig.add_trace(
        go.Scatter(
            x=X_test,
            y=mu_c - 1.96 * std_c,
            mode="lines",
            line=dict(width=0),
            fill="tonexty",
            fillcolor="rgba(31, 119, 180, 0.18)",
            showlegend=False,
        ),
        row=1,
        col=col,
    )
    fig.add_trace(go.Scatter(x=X_test, y=mu_c, mode="lines", line=dict(width=3), showlegend=False), row=1, col=col)
    fig.add_trace(go.Scatter(x=X_train, y=y_train, mode="markers", marker=dict(size=6, color="black"), showlegend=False), row=1, col=col)

fig.update_layout(title="Noise controls interpolation vs smoothing")
fig.show()

4) GP regression from scratch: what the equations are doing#

The posterior mean can be rewritten as:

\[ \mu_*(x_*) = k(x_*,X)\,[K(X,X)+\sigma^2 I]^{-1} y. \]

Think of it like this:

  • \(k(x_*,X)\) measures similarity between \(x_*\) and each training point.

  • \([K+\sigma^2 I]^{-1}y\) produces weights that account for redundancy between training points.

So GP regression isn’t just “weighted averaging” — it’s weighted averaging corrected for correlations.

Also notice the computation bottleneck: factoring/inverting the \(n\times n\) matrix \(K+\sigma^2 I\).

That’s why we used a Cholesky factorization above.

5) Hyperparameters + log marginal likelihood (Occam’s razor)#

Kernels have hyperparameters (length-scale, amplitude, periodicity, …). In GP regression we can fit them by maximizing the log marginal likelihood:

\[ \log p(y\mid X,\theta) =-\frac{1}{2}y^T K_y^{-1}y -\frac{1}{2}\log|K_y| -\frac{n}{2}\log(2\pi), \quad K_y = K(X,X;\theta) + \sigma^2 I. \]

Interpretation:

  • The quadratic term (\(y^T K_y^{-1}y\)) rewards data fit.

  • The log-determinant term (\(\log|K_y|\)) penalizes overly flexible models (it’s a built-in complexity penalty).

That balance is why people say GPs implement an “Occam’s razor” automatically.

# Visualize the log marginal likelihood over a small grid of (length_scale, noise)
length_scales = np.linspace(0.2, 2.8, 80)
noises = np.linspace(0.02, 0.6, 80)  # noise std

LML = np.zeros((len(noises), len(length_scales)))

for i, ns in enumerate(noises):
    for j, ls in enumerate(length_scales):
        k_ij = lambda X, Z, ls=ls: rbf_kernel(X, Z, length_scale=ls, variance=1.0)
        LML[i, j] = log_marginal_likelihood_zero_mean(X_train, y_train, k_ij, noise_variance=ns**2)

best = np.unravel_index(np.argmax(LML), LML.shape)
best_noise, best_ls = noises[best[0]], length_scales[best[1]]
print("best length_scale ~", float(best_ls), "best noise std ~", float(best_noise))

fig = px.imshow(
    LML,
    x=length_scales,
    y=noises,
    origin="lower",
    aspect="auto",
    color_continuous_scale="Viridis",
    title="Log marginal likelihood surface (RBF): fit vs complexity tradeoff",
)
fig.update_layout(xaxis_title="length_scale", yaxis_title="noise std")
fig.add_trace(
    go.Scatter(
        x=[best_ls],
        y=[best_noise],
        mode="markers",
        marker=dict(size=12, color="red", symbol="x"),
        name="max",
    )
)
fig.show()
best length_scale ~ 1.351898734177215 best noise std ~ 0.16683544303797468
# A quick "adaptive fitting" demo: pick the next point where uncertainty is largest.

def gp_std_on_grid(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, kernel_fn, noise_variance: float) -> np.ndarray:
    mu_, cov_ = gp_posterior_regression(X_train, y_train, X_test, kernel_fn, noise_variance=noise_variance)
    return np.sqrt(np.clip(np.diag(cov_), 0.0, np.inf))


X_train_a = X_train.copy()
y_train_a = y_train.copy()

std_before = gp_std_on_grid(X_train_a, y_train_a, X_test, k, noise_variance=sigma**2)
x_new = float(X_test[np.argmax(std_before)])
y_new = float(f_true(np.array([x_new]))[0] + sigma * rng.normal())

X_train_a = np.sort(np.r_[X_train_a, x_new])
y_train_a = np.r_[y_train_a, y_new][np.argsort(np.r_[X_train, x_new])]

std_after = gp_std_on_grid(X_train_a, y_train_a, X_test, k, noise_variance=sigma**2)

fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=["Uncertainty before adding a new point", "Uncertainty after adding it"])
fig.add_trace(go.Scatter(x=X_test, y=std_before, mode="lines", name="std"), row=1, col=1)
fig.add_trace(go.Scatter(x=[x_new], y=[std_before.max()], mode="markers", marker=dict(size=10, color="red"), name="chosen x"), row=1, col=1)

fig.add_trace(go.Scatter(x=X_test, y=std_after, mode="lines", showlegend=False), row=2, col=1)
fig.add_trace(go.Scatter(x=[x_new], y=[std_after[np.argmax(std_before)]], mode="markers", marker=dict(size=10, color="red"), showlegend=False), row=2, col=1)

fig.update_layout(title="Adaptive sampling intuition: query where the GP is most uncertain")
fig.update_xaxes(title_text="x", row=2, col=1)
fig.update_yaxes(title_text="posterior std", row=1, col=1)
fig.update_yaxes(title_text="posterior std", row=2, col=1)
fig.show()

print("picked x_new=", x_new, "| y_new=", y_new)
picked x_new= 6.0 | y_new= -0.6056786127882988

6) scikit-learn GP regression + parameter map#

scikit-learn provides GaussianProcessRegressor (GPR). It optimizes kernel hyperparameters by maximizing the log marginal likelihood.

Key parameters:

  • kernel: the covariance function. If you include bounds, sklearn will optimize those parameters.

  • alpha: added to the diagonal during fitting.

    • can represent observation noise variance (or just numerical jitter)

  • normalize_y: center/scale the targets internally (often helpful)

  • n_restarts_optimizer: tries multiple initializations (helps avoid bad local optima)

  • random_state: reproducibility for the optimizer restarts

Common kernel pattern:

  • ConstantKernel sets overall amplitude

  • RBF (or Matern) sets smoothness/length-scale

  • WhiteKernel explicitly models noise

Xtr = X_train.reshape(-1, 1)
Xte = X_test.reshape(-1, 1)

kernel = ConstantKernel(1.0, (1e-2, 1e2)) * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2)) + WhiteKernel(
    noise_level=sigma**2,
    noise_level_bounds=(1e-6, 1e0),
)

gpr = GaussianProcessRegressor(
    kernel=kernel,
    alpha=0.0,
    normalize_y=True,
    n_restarts_optimizer=5,
    random_state=42,
)
gpr.fit(Xtr, y_train)

print("Learned kernel:")
print(gpr.kernel_)
print("log marginal likelihood:", float(gpr.log_marginal_likelihood(gpr.kernel_.theta)))

mu_sk, std_sk = gpr.predict(Xte, return_std=True)

fig = go.Figure()
fig.add_trace(go.Scatter(x=X_test, y=mu_sk + 1.96 * std_sk, mode="lines", line=dict(width=0), showlegend=False))
fig.add_trace(
    go.Scatter(
        x=X_test,
        y=mu_sk - 1.96 * std_sk,
        mode="lines",
        line=dict(width=0),
        fill="tonexty",
        fillcolor="rgba(0, 150, 136, 0.18)",
        name="95% interval",
    )
)
fig.add_trace(go.Scatter(x=X_test, y=mu_sk, mode="lines", name="mean", line=dict(width=3, color="rgb(0,150,136)")))
fig.add_trace(go.Scatter(x=X_train, y=y_train, mode="markers", marker=dict(size=8, color="black"), name="observations"))
fig.add_trace(go.Scatter(x=X_test, y=f_true(X_test), mode="lines", line=dict(dash="dot"), name="true f(x)", opacity=0.7))
fig.update_layout(title="GaussianProcessRegressor (sklearn): posterior mean + 95% interval", xaxis_title="x", yaxis_title="y")
fig.show()
Learned kernel:
1.23**2 * RBF(length_scale=1.29) + WhiteKernel(noise_level=0.055)
log marginal likelihood: -12.687509884300587

8) Practical tips, limitations, and scaling strategies#

Practical tips#

  • Scale inputs: kernels are distance-based; feature scaling changes everything.

  • Start with RBF (very smooth) or Matern (rougher) and let the optimizer tune length-scales.

  • For regression, include a WhiteKernel (explicit noise) or use alpha.

  • Use n_restarts_optimizer if hyperparameters are sensitive.

Complexity (why GPs don’t scale by default)#

  • Training typically needs a Cholesky of an \(n\times n\) matrix: \(\mathcal{O}(n^3)\).

  • Storing the kernel matrix costs \(\mathcal{O}(n^2)\) memory.

High-dimensional inputs#

In high dimensions, distances become less informative and kernels can behave poorly.

Common fixes:

  • dimensionality reduction / feature engineering

  • ARD kernels (separate length-scale per feature)

  • sparse/approximate GP methods (inducing points, random Fourier features)

Exercises#

  1. Build a GP regression model with a periodic kernel and fit data from a periodic function. How does the posterior behave outside the observed range?

  2. For regression, compare RBF vs Matern(nu=0.5, 1.5, 2.5). Which priors look most realistic for noisy data?

  3. On the moons dataset, sweep over length_scale (fix it instead of optimizing) and visualize under/over-fitting.

  4. Explain (in words) why the log marginal likelihood includes a \(\log|K_y|\) term and how it relates to model complexity.

References#

  • Rasmussen, Williams (2006): Gaussian Processes for Machine Learning (the classic)

  • scikit-learn docs: GaussianProcessRegressor, GaussianProcessClassifier, sklearn.gaussian_process.kernels